[None][fix] guard CUDA graph capture against ADP asymmetric batch-None deadlock#14986
[None][fix] guard CUDA graph capture against ADP asymmetric batch-None deadlock#14986longcheng-nv wants to merge 1 commit into
Conversation
…e deadlock _capture_generation_cuda_graphs iterates batch sizes in reverse order (largest first, so smaller graphs can reuse the memory pool). Under attention-DP (enable_attention_dp=true), KV-cache capacity can differ across TP ranks. _create_cuda_graph_warmup_request returns None on ranks that lack space for the requested batch size. Without a cross-rank check, ranks where batch=None silently `continue` while other ranks enter forward() containing tp_comm collectives (NCCL allreduce / DEP alltoall). The skipped ranks never reach the collective, causing a permanent distributed deadlock. The process hangs before producing any Python output (run.log = 0 bytes) and must be killed externally. The identical scenario is already guarded in _general_warmup_impl and _run_autotuner_warmup via _assert_all_tp_ranks_have_warmup_batch. Apply the same pattern to _capture_generation_cuda_graphs: - If tp_size <= 1: single-rank path, safe to skip silently (unchanged). - If tp_size > 1: call _assert_all_tp_ranks_have_warmup_batch to detect asymmetry and raise RuntimeError before entering forward(), then skip only when all ranks agree the batch is None. Add unit tests covering the three cases: asymmetric None raises, all-None skips gracefully, all-valid proceeds to forward(). A fourth structural test asserts the guard call is present in the source to catch future regressions. Made-with: Claude Code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: longcheng-nv <243710427+longcheng-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR prevents potential deadlocks in CUDA graph generation when TP ranks have asymmetric KV-cache capacity. The engine now asserts batch consistency across ranks before skipping warmup, and comprehensive tests validate the asymmetric detection, graceful all-None skip, and normal all-valid progression paths. ChangesDeadlock Prevention in CUDA Graph Warmup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #52252 [ run ] triggered by Bot. Commit: |
|
PR_Github #52252 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #52584 [ run ] triggered by Bot. Commit: |
Summary
_capture_generation_cuda_graphsinPyTorchModelEnginelacks the cross-rank batch-None check that_general_warmup_impland_run_autotuner_warmupalready have.enable_attention_dp=true+ CUDA graph batch sizes ≥ 16 → permanent distributed deadlock during engine warmup. Process hangs before producing any Python output (run.log = 0 bytes), 100% reproducible._assert_all_tp_ranks_have_warmup_batchguard (already present in the two other warmup paths) into_capture_generation_cuda_graphs.Root Cause
_capture_generation_cuda_graphsiterates batch sizes in reverse order (largest first, so smaller graphs can reuse the memory pool). Under attention-DP, KV-cache capacity differs per TP rank._create_cuda_graph_warmup_requestreturnsNoneon ranks that lack space. Without a cross-rank check:batch is Nonesilentlycontinueforward()containingtp_commcollectives (NCCL allreduce / DEP alltoall)The two other warmup paths already guard this with
_assert_all_tp_ranks_have_warmup_batch(added in an earlier PR). This PR closes the remaining gap.Fix
Tests
Added
tests/unittest/_torch/executor/test_cuda_graph_capture_adp_guard.pywith four unit tests:RuntimeErrorbeforeforward()forward()called normallyTest plan
pytest tests/unittest/_torch/executor/test_cuda_graph_capture_adp_guard.py(no GPU needed)→ 4/4 passed (no GPU required; 1.61 s wall time on 8×B300 host)
enable_attention_dp=true+cuda_graph_config.batch_sizes=[1,2,...,32]no longer deadlocks at warmup→ Verified via DSv4 Pro DEP GSM8K eval (8×B300, MTP=3, GVR ON, BS up to 32):
engine initialized cleanly in 452 s, inferred all 1319 problems without hang.
Accuracy: 96.51% average (flexible-extract 96.51, strict-match 96.51),
matching TEP mode (96.63%) within ±0.5 pp statistical error.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests